This is Kara’s attempt to think about patterns of prevalence across sites: What proportion of participants in each site endorsed each question, and can we find “clusters” of questions that were either similar in prevalence across sites or difference across sites in various ways?

This is inspired by Ann’s ideas on 2018-10-08, but uses a somewhat different strategy.

Update 2018-12-17: Now including tests of hypotheses from our meeting on 2018-12-10.

Notes: per our conversation with Nikki, we are dropping one question (#53), which was a repeated question in all sites except for China.

Make new dataset

First, I’m going to make a new dataset where, for each question, we have the proportion of yes reponses from each of the five field sites.

Joining, by = "taves_subj"
Joining, by = "question"
Setting row names on a tibble is deprecated.

Here’s a sample of what this new dataset looks like (5 of the 59 columns):

Please specify format in kable. kableExtra can customize either HTML or LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for details.

Site 10. I have distinct memories of having lived a different life, in a different time and place. 11. I have perceived or interacted with what seemed to be a world or reality other than the one I usually inhabit. 12. I have had the experience of being aware that I was dreaming while asleep. 13. I have sensed the presence of what seemed to be non-ordinary forces or entities. 14. I have had an experience in which the meaning and purpose of my life suddenly became clear to me.
US 0.0961538 0.0776699 0.6310680 0.2549020 0.2788462
Ghana 0.4409449 0.3360000 0.6220472 0.3840000 0.4765625
Thailand 0.0952381 0.0769231 0.8190476 0.2380952 0.4761905
China 0.1279070 0.2808989 0.8395062 0.2159091 0.5308642
Vanuatu 0.6000000 0.3789474 0.7608696 0.3541667 0.8210526

Hierarchical clustering

My first instinct was to try hierarchical clustering on this new dataset. Each question is associated with 5 prevalences (one for each of our 5 fieldsites). In this cluster analysis, we’re looking for questions that share similar patterns of prevalence across the fieldsites - e.g., one cluster might identify a set questions where the prevalence is roughly the same across the 5 sites; another cluster might identify a set of questions where the prevalence is high in Ghana and Vanuatu but low everywhere else; etc. I think this captures some of the spirit of what Ann was after today (10/8)… though maybe not everything (e.g., it might not differentiate between questions where prevalence is high across the 5 sites vs. questions where prevalence is low across the 5 sites).

After some playing around with this, I’m going to extract 9 clusters here - I’ve colored them according such above. This is a subjective call - you could extract more or fewer. I think this seems kind of reasonable eyeballing the plot above.

Joining, by = "question_text"
Joining, by = "question"

Cluster 1

Here I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 1. (Note that the order of clusters doesn’t align with the top to bottom order of the previous plot - sorry if that’s confusing, it’s just an artifact of how the previous plot worked. Not very meaningful. This is the light orange cluster above.)

I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is low in the US and Thailand (~25%), slightly higher in China (but under 50%), and higher in Ghana and Vanuatu (generally over 50%).

I won’t try to interpret the meaning of these questions right now.

Cluster 2

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 2 (the pink cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is something like very low in the US (generally < 10%), pretty low in Thailand and China (generally < 25%), middling in Ghana (around 25%), and moderate in Vanuatu (under 50%). But there are some exceptions here - e.g., #26 (where Thailand is highest); #47 and #39 (where all sites are comparable); #51 (where Ghana, Thailand, and China are all comparable). I would say that these questions are less good examples of this “cluster.”

Cluster 3

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 3 (the dark red cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is generally low (~25%) across the board, but slightly higher in Ghana and Vanuatu (closer to 40%).

Cluster 4

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 4 (the light blue cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is generally comparable and pretty high (~75%) across the board, especially in China and Thailand.

Cluster 5

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 5 (the dark orange cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is generally comparable and moderate (~25%) across the board, but higher in Ghana and Vanuatu (closer to 50%).

Cluster 6

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 6 (the dark green cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is generally comparable and moderate (~25%) in the US and Thailand, but higher in Ghana, China, and especially Vanuatu (>50%).

Cluster 7

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 7 (the dark blue cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is generally moderate (~25%) in the US and Thailand, higher in Ghana and Vanuatu (~50%), and especially high in China (~60%).

Cluster 8

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 8 (the light purple cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is generally moderate (~25%) across the board, maybe a little higher in China and Vanuatu (closer to 40%).

Cluster 9

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 9 (the light green cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is generally high (~50%) across the board, and espcially in China (closer to 70%).

Hierarchical clustering pretending we only had 2 sites (US and Ghana)

Here I will try to approximate what it would be like to do this analysis with only 2 sites, to give a sense of whether this would make sense for the 1.0 data. I’ll limit the data to only 2 sites (US and Ghana) and do everything I did before.

Setting row names on a tibble is deprecated.

Here, 8 seems like a reasonable number of clusters to look at. I wish that the light orange cluster weren’t a cluster of 1, but it requires going down to just 5 clusters to get there, which means that the other clusters get quite large.

Joining, by = "question_text"
Joining, by = "question"

Cluster 1

Following the 5-site analysis above, here I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 1. (This is the pink cluster above.)

I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is moderate in the US (>25%) and higher in Ghana (close to 50%).

Cluster 2

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 2 (the light green cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is low in the US (<10%) and moderate in Ghana (~25%).

Cluster 3

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 3 (the dark green cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is relatively high across the board (>50%).

Cluster 4

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 4 (the dark blue cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is generally moderate across the two sites (~25%), slightly higher in Ghana (~40%).

Cluster 5

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 5 (the dark red cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is generally comparable and moderate (close to 50%) across the board - if anything, perhaps a little higher in the US on a couple of items.

Cluster 6

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 6 (the light blue cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is very low in the US (<10%) and slightly higher in Ghana (~20%).

Cluster 7

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 7 (the dark orange cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

It looks like the prevalence pattern here is fairly low in the US (<25%) and high in Ghana (>50%).

Cluster 8

Now I’ll plot the prevalence of endorsements in each site for each of the questions in Cluster 8 (the other half of the dark green cluster in the overall plot).

Again, I’ll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

This is one item and it happens to be moderate inthe US and high in Ghana. I wouldn’t make much of this factor.

Testing predictions

These are predictions we articulated as a group on 2018-12-10

Clusters 6794 are more common than the others

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: prev ~ level1 + (1 | question) + (1 | taves_ctry)
   Data: hclust_df_grouped
Weights: n

     AIC      BIC   logLik deviance df.resid 
  2922.3   2937.1  -1457.2   2914.3      291 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.3522 -1.3453 -0.1095  1.2750  5.9963 

Random effects:
 Groups     Name        Variance Std.Dev.
 question   (Intercept) 0.1546   0.3932  
 taves_ctry (Intercept) 0.1483   0.3851  
Number of obs: 295, groups:  question, 59; taves_ctry, 5

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.8383     0.1833  -4.574 4.77e-06 ***
level16794    1.0809     0.1163   9.297  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
           (Intr)
level16794 -0.184

Yes, this prediction is upheld (both in the aggregate and looking at each site individually).

Clusters 679 appraised as religious more frequently than cluster 4

Joining, by = "taves_subj"
Joining, by = "question"
Linear mixed model fit by REML ['lmerMod']
Formula: 
appraisal ~ level2common + (1 | question) + (1 | taves_ctry/taves_subj)
   Data: d_appraisal_num2 %>% filter(!is.na(appraisal))

REML criterion at convergence: 11413.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.5831 -0.6775 -0.1177  0.6578  2.7814 

Random effects:
 Groups                Name        Variance Std.Dev.
 taves_subj:taves_ctry (Intercept) 0.16494  0.4061  
 question              (Intercept) 0.01407  0.1186  
 taves_ctry            (Intercept) 0.06927  0.2632  
 Residual                          0.46034  0.6785  
Number of obs: 5141, groups:  
taves_subj:taves_ctry, 521; question, 17; taves_ctry, 5

Fixed effects:
              Estimate Std. Error t value
(Intercept)   -0.20296    0.12379  -1.640
level2common4 -0.08811    0.07898  -1.116

Correlation of Fixed Effects:
            (Intr)
level2cmmn4 -0.116

The results for this prediction are mixed. It is not upheld in the aggregate, but seems to be true in the US, and possibly in Ghana and Vanuatu (i.e., Christian countries?).

Clusters 679 will be marked as positive/negative rather than cluster 4 (?)

Joining, by = "taves_subj"
Joining, by = "question"
Linear mixed model fit by REML ['lmerMod']
Formula: 
abs(valence) ~ level2common + (1 | question) + (1 | taves_ctry/taves_subj)
   Data: d_valence_num2 %>% filter(!is.na(valence))

REML criterion at convergence: 6562.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.1912 -0.9412  0.3295  0.8207  2.2519 

Random effects:
 Groups                Name        Variance Std.Dev.
 taves_subj:taves_ctry (Intercept) 0.027980 0.1673  
 question              (Intercept) 0.014461 0.1203  
 taves_ctry            (Intercept) 0.004901 0.0700  
 Residual                          0.192904 0.4392  
Number of obs: 5077, groups:  
taves_subj:taves_ctry, 521; question, 17; taves_ctry, 5

Fixed effects:
              Estimate Std. Error t value
(Intercept)    0.60997    0.04609  13.233
level2common4 -0.16528    0.07799  -2.119

Correlation of Fixed Effects:
            (Intr)
level2cmmn4 -0.302

Yes, this prediction is upheld (both in the aggregate and looking at each site individually, with the possible exception of Vanuatu).

Cluster 69 will be more positive than cluster 7

Linear mixed model fit by REML ['lmerMod']
Formula: valence ~ level3 + (1 | question) + (1 | taves_ctry/taves_subj)
   Data: d_valence_num2 %>% filter(!is.na(valence))

REML criterion at convergence: 7968.2

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.78548 -0.68151  0.06094  0.73060  2.99054 

Random effects:
 Groups                Name        Variance Std.Dev.
 taves_subj:taves_ctry (Intercept) 0.049447 0.22237 
 question              (Intercept) 0.104683 0.32355 
 taves_ctry            (Intercept) 0.003939 0.06276 
 Residual                          0.398107 0.63096 
Number of obs: 3949, groups:  
taves_subj:taves_ctry, 513; question, 14; taves_ctry, 5

Fixed effects:
            Estimate Std. Error t value
(Intercept) -0.02973    0.12683  -0.234
level369     0.42330    0.17416   2.431

Correlation of Fixed Effects:
         (Intr)
level369 -0.688

Yes, this prediction is clearly upheld (both in the aggregate and looking at each site individually).

Clusters 679 appraised as more significant than 4

Joining, by = "taves_subj"
Joining, by = "question"
Linear mixed model fit by REML ['lmerMod']
Formula: 
significance ~ level2common + (1 | question) + (1 | taves_ctry/taves_subj)
   Data: d_significance_num2 %>% filter(!is.na(significance))

REML criterion at convergence: 3586.9

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-3.02179 -0.65557  0.03006  0.74054  2.54980 

Random effects:
 Groups                Name        Variance Std.Dev.
 taves_subj:taves_ctry (Intercept) 0.032477 0.18021 
 question              (Intercept) 0.006291 0.07932 
 taves_ctry            (Intercept) 0.001747 0.04179 
 Residual                          0.102079 0.31950 
Number of obs: 5075, groups:  
taves_subj:taves_ctry, 520; question, 17; taves_ctry, 5

Fixed effects:
              Estimate Std. Error t value
(Intercept)    0.59594    0.02989  19.936
level2common4 -0.12958    0.05166  -2.508

Correlation of Fixed Effects:
            (Intr)
level2cmmn4 -0.310

Yes, this prediction is upheld (both in the aggregate and looking at each site individually).

Clusters 851 more common than clusters 32

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: prev ~ level2uncommon + (1 | question) + (1 | taves_ctry)
   Data: hclust_df_grouped
Weights: n

     AIC      BIC   logLik deviance df.resid 
  1791.0   1804.4   -891.5   1783.0      206 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.8760 -1.3721 -0.1709  1.0627  4.8106 

Random effects:
 Groups     Name        Variance Std.Dev.
 question   (Intercept) 0.05679  0.2383  
 taves_ctry (Intercept) 0.20677  0.4547  
Number of obs: 210, groups:  question, 42; taves_ctry, 5

Fixed effects:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -1.30327    0.21510  -6.059 1.37e-09 ***
level2uncommon851  0.66649    0.08514   7.828 4.97e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr)
lvl2ncmm851 -0.268

Yes, this prediction is upheld (both in the aggregate and looking at each site individually, with the probable exception of Vanuatu).

Clusters 8 to 5 to 1: increases in divergence between countries


    Bartlett test of homogeneity of variances

data:  mean_prev by cluster
Bartlett's K-squared = 2.7809, df = 2, p-value = 0.249


    Bartlett test of homogeneity of variances

data:  mean_prev by cluster
Bartlett's K-squared = 0.61953, df = 1, p-value = 0.4312


    Bartlett test of homogeneity of variances

data:  mean_prev by cluster
Bartlett's K-squared = 0.83052, df = 1, p-value = 0.3621

This prediction looks like it’s upheld according to the plot (dots for each site are closest together for cluster 8 and most scattered for cluster 1). It doesn’t come out as significant in a formal test of the variability across the five sites for each cluster, but I’m not sure whether I’d expect that to come out or not, given the sample size (n = 5 sites). (Genuinely not sure!)

Clusters 8 to 5 to 1: decreases in frequency

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: prev ~ cluster + (1 | question) + (1 | taves_ctry)
   Data: hclust_df_grouped %>% filter(cluster %in% c("1", "5", "8")) %>%  
    mutate(cluster = factor(cluster, levels = c("8", "5", "1")))
Weights: n

     AIC      BIC   logLik deviance df.resid 
  1134.8   1149.5   -562.4   1124.8      135 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.2572 -1.0490  0.0215  0.9268  4.3298 

Random effects:
 Groups     Name        Variance Std.Dev.
 question   (Intercept) 0.00329  0.05736 
 taves_ctry (Intercept) 0.14749  0.38405 
Number of obs: 140, groups:  question, 28; taves_ctry, 5

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.66146    0.17311  -3.821 0.000133 ***
cluster.L    0.14281    0.03528   4.047 5.18e-05 ***
cluster.Q    0.33253    0.03879   8.573  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
          (Intr) clst.L
cluster.L -0.021       
cluster.Q -0.016 -0.113

No, this prediction is not upheld (either in the aggregate or looking at each site individually, with the possible exception of Thailand). If anything, cluster 1 generally seems to be the most common of these three clusters.

Clusters 32: GH and VT stand out for their endorsement

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: prev ~ level2uncommon * taves_ctry + (1 | question)
   Data: 
hclust_df_grouped %>% filter(!is.na(level2uncommon)) %>% mutate(taves_ctry = factor(taves_ctry,  
    levels = c("US", "Ghana", "Thailand", "China", "Vanuatu")))
Weights: n

     AIC      BIC   logLik deviance df.resid 
  1713.7   1750.5   -845.8   1691.7      199 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.0518 -1.1072 -0.1083  1.1510  4.9969 

Random effects:
 Groups   Name        Variance Std.Dev.
 question (Intercept) 0.0558   0.2362  
Number of obs: 210, groups:  question, 42

Fixed effects:
                                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)                           -1.35863    0.07055 -19.258  < 2e-16 ***
level2uncommon851                      0.73052    0.08542   8.552  < 2e-16 ***
taves_ctryUS_other                    -0.18649    0.01800 -10.359  < 2e-16 ***
taves_ctryGHVT_THCH                    0.53321    0.03293  16.193  < 2e-16 ***
taves_ctryGH_VT                       -0.30629    0.03869  -7.917 2.43e-15 ***
taves_ctryTH_CH                        0.07194    0.05326   1.351  0.17682    
level2uncommon851:taves_ctryUS_other   0.05719    0.02039   2.805  0.00503 ** 
level2uncommon851:taves_ctryGHVT_THCH -0.25031    0.03843  -6.514 7.34e-11 ***
level2uncommon851:taves_ctryGH_VT      0.13824    0.04666   2.963  0.00305 ** 
level2uncommon851:taves_ctryTH_CH      0.05896    0.06104   0.966  0.33407    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) lv2851 tv_US_ t_GHVT t_GH_V t_TH_C l2851:_U l2851:_GHV
lvl2ncmm851 -0.826                                                       
tvs_ctryUS_  0.086 -0.071                                                
t_GHVT_THCH -0.117  0.097  0.112                                         
tvs_ctGH_VT -0.012  0.010  0.013 -0.035                                  
tvs_ctTH_CH  0.010 -0.008 -0.010 -0.026  0.000                           
lv2851:_US_ -0.076  0.070 -0.883 -0.099 -0.012  0.009                    
l2851:_GHVT  0.100 -0.096 -0.096 -0.857  0.030  0.023  0.098             
l2851:_GH_V  0.010 -0.016 -0.011  0.029 -0.829  0.000  0.018   -0.049    
l2851:_TH_C -0.008  0.010  0.009  0.023  0.000 -0.873 -0.011   -0.029    
            l2851:_GH_
lvl2ncmm851           
tvs_ctryUS_           
t_GHVT_THCH           
tvs_ctGH_VT           
tvs_ctTH_CH           
lv2851:_US_           
l2851:_GHVT           
l2851:_GH_V           
l2851:_TH_C  0.000    

Yes, this prediction is upheld, but rather weakly: Ghana and Vanuatu stand out in both clusters 32 and clusters 851, but they stand out a little more (gap between them and other countries is a little bigger) in clusters 32.

Overall summaries of prevalence, appraisal, valence, & significance

---
title: "KW looking at patterns of prevalence"
date: 2018-10-08
output: 
  html_notebook:
    toc: true
    toc_float: true
---

This is Kara's attempt to think about patterns of prevalence across sites: What proportion of participants in each site endorsed each question, and can we find "clusters" of questions that were either similar in prevalence across sites or difference across sites in various ways?

This is inspired by Ann's ideas on 2018-10-08, but uses a somewhat different strategy.

**Update 2018-12-17**: Now including tests of hypotheses from our meeting on 2018-12-10.

```{r global_options, include = F}
knitr::opts_chunk$set(echo=F, warning=F, cache=F, message=F)
```

```{r, include = F}
library(tidyverse)
library(readxl)
library(psych)
library(factoextra)
library(ggdendro)
library(dendextend)
library(lme4)
library(langcog)
```

```{r, include = F}
d0 <- read_excel("/Users/kweisman/Documents/Research (Stanford)/Projects/Templeton Grant/DATA WRANGLING/Taves/data/Taves_full_dataset.xlsx", sheet = 5)[-1,] # remove question text

question_key <- read_excel("/Users/kweisman/Documents/Research (Stanford)/Projects/Templeton Grant/DATA WRANGLING/Taves/data/Taves_full_dataset.xlsx", sheet = 3)[,1:5] # only relevant columns

n_iter <- 5000
```

```{r, include = F}
d_base <- d0 %>%
  data.frame() %>%
  select(taves_subj, taves_01:taves_60e) %>%
  select(-ends_with("a"), -ends_with("b"), -ends_with("c"), 
         -ends_with("d"), -ends_with("e")) %>%
  distinct() %>%
  filter(taves_subj != "40548") %>% # remove one duplicate
  # column_to_rownames("taves_subj") %>%
  mutate_at(vars(-taves_subj),
            funs(factor(tolower(.), levels = c("no", "yes")))) %>%
  mutate_at(vars(-taves_subj),
            funs(num = as.numeric(.) - 1)) %>%
  column_to_rownames("taves_subj") %>%
  select(-starts_with("taves_53"))

d_base_num <- d_base %>%
  select(ends_with("_num"))
```

Notes: per our conversation with Nikki, we are dropping one question (#53), which was a repeated question in all sites except for China.

```{r, include = F}
d_appraisal <- d0 %>%
  data.frame() %>%
  select(taves_subj, taves_01:taves_60e) %>%
  select(taves_subj, ends_with("b")) %>%
  distinct() %>%
  filter(taves_subj != "40548") %>% # remove one duplicate
  # column_to_rownames("taves_subj") %>%
  mutate_at(vars(-taves_subj),
            funs(factor(tolower(.), levels = c("no", "unsure", "yes")))) %>%
  mutate_at(vars(-taves_subj),
            funs(num = as.numeric(.) - 2)) %>%
  column_to_rownames("taves_subj") %>%
  select(-starts_with("taves_53"))

d_appraisal_num <- d_appraisal %>%
  select(ends_with("_num"))
```

```{r, include = F}
d_valence <- d0 %>%
  data.frame() %>%
  select(taves_subj, taves_01:taves_60e) %>%
  select(taves_subj, ends_with("d")) %>%
  distinct() %>%
  filter(taves_subj != "40548") %>% # remove one duplicate
  # column_to_rownames("taves_subj") %>%
  mutate_at(vars(-taves_subj),
            funs(factor(tolower(.), levels = c("somewhat negative", 
                                               "about neutral", 
                                               "somewhat positive")))) %>%
  mutate_at(vars(-taves_subj),
            funs(num = as.numeric(.) - 2)) %>%
  column_to_rownames("taves_subj") %>%
  select(-starts_with("taves_53"))

d_valence_num <- d_valence %>%
  select(ends_with("_num"))
```

```{r, include = F}
d_significance <- d0 %>%
  data.frame() %>%
  select(taves_subj, taves_01:taves_60e) %>%
  select(taves_subj, ends_with("e")) %>%
  distinct() %>%
  filter(taves_subj != "40548") %>% # remove one duplicate
  # column_to_rownames("taves_subj") %>%
  mutate_at(vars(-taves_subj),
            funs(factor(tolower(.), levels = c("not significant", 
                                               "somewhat significant", 
                                               "very significant")))) %>%
  mutate_at(vars(-taves_subj),
            funs(num = (as.numeric(.) - 1)/2)) %>%
  column_to_rownames("taves_subj") %>%
  select(-starts_with("taves_53"))

d_significance_num <- d_significance %>%
  select(ends_with("_num"))
```

# Make new dataset

First, I'm going to make a new dataset where, for each question, we have the proportion of yes reponses from each of the five field sites.

```{r}
d_prev <- d_base_num %>%
  rownames_to_column("taves_subj") %>%
  gather(question, response, -taves_subj) %>%
  left_join(d0 %>% distinct(taves_subj, taves_ctry)) %>%
  group_by(taves_ctry, question) %>%
  summarise(prev = mean(response, na.rm = T)) %>%
  ungroup() %>%
  mutate(question = gsub("_num", "", question)) %>%
  left_join(question_key %>%
              rename(question = `Variable Name - VERSION 1 -- all variables in version2 have been renamed to reflect these varaible names`,
                     question_text = `Question - VERSION 1`)) %>%
  select(taves_ctry, question, question_text, prev) %>%
  mutate(question_text = gsub("\r", " ", question_text),
         question_text = gsub("\n", " ", question_text),
         question_text = gsub("  ", " ", question_text),
         question_text = gsub("‚Äô", "'", question_text),
         question_text = gsub("‚Äú", "'", question_text),
         question_text = gsub("‚Äù", "'", question_text),
         taves_ctry = factor(taves_ctry,
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu")))


d_prev_wide <- d_prev %>%
  mutate(question = factor(question)) %>%
  arrange(question) %>%
  select(-question) %>%
  spread(question_text, prev) %>%
  mutate(taves_ctry = as.character(taves_ctry)) %>%
  column_to_rownames("taves_ctry")
```

Here's a sample of what this new dataset looks like (5 of the 59 columns):

```{r, results = "asis"}
d_prev_wide %>% 
  select(2:6) %>% 
  rownames_to_column("Site") %>% 
  knitr::kable() %>% 
  kableExtra::kable_styling()
```


<P style="page-break-before: always">
# Hierarchical clustering

My first instinct was to try hierarchical clustering on this new dataset. Each question is associated with 5 prevalences (one for each of our 5 fieldsites). In this cluster analysis, we're looking for questions that share similar patterns of prevalence across the fieldsites - e.g., one cluster might identify a set questions where the prevalence is roughly the same across the 5 sites; another cluster might identify a set of questions where the prevalence is high in Ghana and Vanuatu but low everywhere else; etc. I think this captures some of the spirit of what Ann was after today (10/8)... though maybe not everything (e.g., it might not differentiate between questions where prevalence is high across the 5 sites vs. questions where prevalence is low across the 5 sites).

```{r}
hclust_prev <- d_prev_wide %>% 
  t() %>% 
  dist() %>% 
  hclust()
```

```{r, fig.width = 8, fig.asp = 1}
hclust_prev %>% 
  as.dendrogram() %>% 
  set("labels_col", k = 9, value = c("#a6cee3", "#1f78b4", "#b2df8a",
                                     "#33a02c", "#fb9a99", "#e31a1c", 
                                     "#fdbf6f", "#ff7f00", "#cab2d6")) %>% 
  plot(horiz = T, xlim = c(1.5, -12), axes = F)
```

After some playing around with this, I'm going to extract 9 clusters here - I've colored them according such above. This is a subjective call - you could extract more or fewer. I think this seems kind of reasonable eyeballing the plot above.

```{r}
hclust_df <- data.frame(cluster = cutree(hclust_prev, 9)) %>%
  rownames_to_column("question_text") %>%
  full_join(d_prev) %>%
  left_join(d_prev %>% 
              filter(taves_ctry == "US") %>% 
              distinct(question, prev) %>% 
              rename("US_prev" = "prev")) %>%
  arrange(cluster, desc(US_prev), taves_ctry)
```

<P style="page-break-before: always">
## Cluster 1

Here I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 1. (Note that the order of clusters doesn't align with the top to bottom order of the previous plot - sorry if that's confusing, it's just an artifact of how the previous plot worked. Not very meaningful. This is the light orange cluster above.)

I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 1}
hclust_df %>%
  filter(cluster == 1) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 1",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is low in the US and Thailand (~25%), slightly higher in China (but under 50%), and higher in Ghana and Vanuatu (generally over 50%).

I won't try to interpret the meaning of these questions right now.


<P style="page-break-before: always">
## Cluster 2

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 2 (the pink cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.5}
hclust_df %>%
  filter(cluster == 2) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 2",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is something like very low in the US (generally < 10%), pretty low in Thailand and China (generally < 25%), middling in Ghana (around 25%), and moderate in Vanuatu (under 50%). But there are some exceptions here - e.g., #26 (where Thailand is highest); #47 and #39 (where all sites are comparable); #51 (where Ghana, Thailand, and China are all comparable). I would say that these questions are less good examples of this "cluster."


<P style="page-break-before: always">
## Cluster 3

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 3 (the dark red cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.7}
hclust_df %>%
  filter(cluster == 3) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 3",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is generally low (~25%) across the board, but slightly higher in Ghana and Vanuatu (closer to 40%).


<P style="page-break-before: always">
## Cluster 4

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 4 (the light blue cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.3}
hclust_df %>%
  filter(cluster == 4) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 4",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is generally comparable and pretty high (~75%) across the board, especially in China and Thailand.


<P style="page-break-before: always">
## Cluster 5

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 5 (the dark orange cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.6}
hclust_df %>%
  filter(cluster == 5) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 5",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is generally comparable and moderate (~25%) across the board, but higher in Ghana and Vanuatu (closer to 50%).


<P style="page-break-before: always">
## Cluster 6

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 6 (the dark green cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.3}
hclust_df %>%
  filter(cluster == 6) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 6",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is generally comparable and moderate (~25%) in the US and Thailand, but higher in Ghana, China, and especially Vanuatu (>50%).


<P style="page-break-before: always">
## Cluster 7

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 7 (the dark blue cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.7}
hclust_df %>%
  filter(cluster == 7) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 7",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is generally moderate (~25%) in the US and Thailand, higher in Ghana and Vanuatu (~50%), and especially high in China (~60%).


<P style="page-break-before: always">
## Cluster 8

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 8 (the light purple cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.7}
hclust_df %>%
  filter(cluster == 8) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 8",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is generally moderate (~25%) across the board, maybe a little higher in China and Vanuatu (closer to 40%).


<P style="page-break-before: always">
## Cluster 9

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 9 (the light green cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.4}
hclust_df %>%
  filter(cluster == 9) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 9",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is generally high (~50%) across the board, and espcially in China (closer to 70%).


# Hierarchical clustering pretending we only had 2 sites (US and Ghana)

Here I will try to approximate what it would be like to do this analysis with only 2 sites, to give a sense of whether this would make sense for the 1.0 data. I'll limit the data to only 2 sites (US and Ghana) and do everything I did before.

```{r}
hclust_prev2 <- d_prev_wide %>% 
  rownames_to_column("taves_ctry") %>%
  filter(taves_ctry %in% c("US", "Ghana")) %>%
  column_to_rownames("taves_ctry") %>%
  t() %>% 
  dist() %>% 
  hclust()
```

```{r, fig.width = 8, fig.asp = 1}
hclust_prev2 %>% 
  as.dendrogram() %>% 
  set("labels_col", k = 8, value = c("#a6cee3", "#1f78b4", "#b2df8a",
                                     "#33a02c", "#fb9a99", "#e31a1c", 
                                     "#fdbf6f", "#ff7f00", "#cab2d6")) %>% 
  plot(horiz = T, xlim = c(1.5, -12), axes = F)
```

Here, 8 seems like a reasonable number of clusters to look at. I wish that the light orange cluster weren't a cluster of 1, but it requires going down to just 5 clusters to get there, which means that the other clusters get quite large.

```{r}
hclust_df2 <- data.frame(cluster = cutree(hclust_prev2, 8)) %>%
  rownames_to_column("question_text") %>%
  full_join(d_prev %>% filter(taves_ctry %in% c("US", "Ghana"))) %>%
  left_join(d_prev %>% 
              filter(taves_ctry == "US") %>% 
              distinct(question, prev) %>% 
              rename("US_prev" = "prev")) %>%
  arrange(cluster, desc(US_prev), taves_ctry)
```

## Cluster 1

Following the 5-site analysis above, here I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 1. (This is the pink cluster above.)

I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 1}
hclust_df2 %>%
  filter(cluster == 1) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 1",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is moderate in the US (>25%) and higher in Ghana (close to 50%).


<P style="page-break-before: always">
## Cluster 2

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 2 (the light green cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.5}
hclust_df2 %>%
  filter(cluster == 2) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 2",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is low in the US (<10%) and moderate in Ghana (~25%).


<P style="page-break-before: always">
## Cluster 3

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 3 (the dark green cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.7}
hclust_df2 %>%
  filter(cluster == 3) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 3",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is relatively high across the board (>50%).


<P style="page-break-before: always">
## Cluster 4

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 4 (the dark blue cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 1}
hclust_df2 %>%
  filter(cluster == 4) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 4",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is generally moderate across the two sites (~25%), slightly higher in Ghana (~40%).


<P style="page-break-before: always">
## Cluster 5

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 5 (the dark red cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.6}
hclust_df2 %>%
  filter(cluster == 5) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 5",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is generally comparable and moderate (close to 50%) across the board - if anything, perhaps a little higher in the US on a couple of items.


<P style="page-break-before: always">
## Cluster 6

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 6 (the light blue cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.4}
hclust_df2 %>%
  filter(cluster == 6) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 6",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is very low in the US (<10%) and slightly higher in Ghana (~20%).

<P style="page-break-before: always">
## Cluster 7

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 7 (the dark orange cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.4}
hclust_df2 %>%
  filter(cluster == 7) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 7",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

It looks like the prevalence pattern here is fairly low in the US (<25%) and high in Ghana (>50%).


<P style="page-break-before: always">
## Cluster 8

Now I'll plot the prevalence of endorsements in each site for each of the questions in Cluster 8 (the other half of the dark green cluster in the overall plot).

Again, I'll plot these questions in descending order of their prevalence in the US sample, just as a point of reference.

```{r, fig.width = 5, fig.asp = 0.2}
hclust_df2 %>%
  filter(cluster == 8) %>%
  mutate(question_text = gsub('(.{1,40})(\\s|$)', '\\1\n', question_text)) %>%
  ggplot(aes(x = reorder(question_text, US_prev), y = prev,
             fill = taves_ctry, color = taves_ctry)) +
  facet_grid(~ taves_ctry, scales = "free") +
  # geom_hline(yintercept = 0.5, lty = 2) +
  geom_bar(stat = "identity", position = "identity", size = 1, alpha = 0.5) +
  scale_fill_brewer(palette = "Dark2", guide = "none") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.5)) +
  labs(title = "Cluster: 8",
       x = "", y = "proportion saying YES") +
  theme_bw() +
  coord_flip()
```

This is one item and it happens to be moderate inthe US and high in Ghana. I wouldn't make much of this factor.

<P style="page-break-before: always">

```{r, include = F}
# Comparing cluster analyses
full_join(hclust_df %>%
            distinct(cluster, question, question_text) %>%
            rename(cluster_5sites = cluster),
          hclust_df2 %>%
            distinct(cluster, question) %>%
            rename(cluster_2sites = cluster)) %>%
  select(question, question_text, cluster_5sites, cluster_2sites) %>%
  count(cluster_5sites, cluster_2sites) %>%
  group_by(cluster_5sites) %>%
  mutate(prop = n/sum(n)) %>%
  top_n(1, prop) %>% 
  arrange(desc(prop))
```

```{r, include = F}
hclust_prev3 <- d_prev_wide %>% 
  rownames_to_column("taves_ctry") %>%
  filter(!taves_ctry %in% c("US", "Ghana")) %>%
  column_to_rownames("taves_ctry") %>%
  t() %>% 
  dist() %>% 
  hclust()
```

```{r, include = F}
hclust_df3 <- data.frame(cluster = cutree(hclust_prev3, 8)) %>%
  rownames_to_column("question_text") %>%
  full_join(d_prev %>% filter(!taves_ctry %in% c("US", "Ghana"))) %>%
  left_join(d_prev %>% 
              filter(taves_ctry == "US") %>% 
              distinct(question, prev) %>% 
              rename("US_prev" = "prev")) %>%
  arrange(cluster, desc(US_prev), taves_ctry)
```

```{r, include = F}
full_join(hclust_df2 %>%
            distinct(cluster, question, question_text) %>%
            rename(cluster_USGH = cluster),
          hclust_df3 %>%
            distinct(cluster, question) %>%
            rename(cluster_THCHVT = cluster)) %>%
  select(question, question_text, cluster_USGH, cluster_THCHVT) %>%
  count(cluster_USGH, cluster_THCHVT) %>%
  group_by(cluster_USGH) %>%
  mutate(prop = n/sum(n)) %>%
  top_n(1, prop) %>% 
  arrange(desc(prop))
```


# Testing predictions

These are predictions we articulated as a group on 2018-12-10

```{r, include = F}
hclust_df_grouped <- hclust_df %>%
  mutate(level1 = case_when(cluster %in% c(6, 9, 7, 4) ~ "6794",
                            cluster %in% c(8, 5, 1, 3, 2) ~ "85132"),
         level2uncommon = case_when(cluster %in% c(8, 5, 1) ~ "851",
                                    cluster %in% c(3, 2) ~ "32",
                                    TRUE ~ NA_character_),
         level2common = case_when(cluster %in% c(6, 9, 7) ~ "679",
                                  cluster %in% c(4) ~ "4",
                                  TRUE ~ NA_character_),
         level3 = case_when(cluster %in% c(6, 9) ~ "69",
                            cluster %in% c(7) ~ "7")) %>%
  mutate(level1 = factor(level1, levels = c("85132", "6794")),
         level2uncommon = factor(level2uncommon, levels = c("32", "851")),
         level2common = factor(level2common, levels = c("679", "4")),
         level3 = factor(level3, levels = c("7", "69")))
```

```{r, include = F}
hclust_df_grouped <- hclust_df_grouped %>%
  full_join(d_base %>%
              rownames_to_column("taves_subj") %>%
              gather(question, response, -taves_subj) %>%
              full_join(d0 %>% distinct(taves_ctry, taves_subj)) %>%
              filter(!is.na(response), !grepl("num", question)) %>%
              count(taves_ctry, question))
```


## Clusters 6794 are more common than the others

```{r}
glmer(prev ~ level1 + (1|question) + (1|taves_ctry),
      hclust_df_grouped,
      family = "binomial", weights = n) %>% 
  summary()
```

```{r}
hclust_df_grouped %>%
  group_by(level1, taves_ctry) %>%
  multi_boot_standard("prev") %>%
  ungroup() %>%
  mutate(taves_ctry = factor(taves_ctry, 
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu"))) %>%
  ggplot(aes(x = level1, y = mean, color = taves_ctry)) + 
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = 0.5)) +
  ylim(0, 1) +
  scale_color_brewer(palette = "Dark2") +
  theme_bw() +
  labs(title = "Prediction: Clusters 6794 are more common than the others",
       x = "Cluster group (level 1 divide)", y = "Prevalence (0-1)",
       color = "Site")
```

Yes, this prediction is upheld (both in the aggregate and looking at each site individually).


<P style="page-break-before: always">
## Clusters 679 appraised as religious more frequently than cluster 4

```{r}
d_appraisal_num2 <- d_appraisal_num %>%
        rownames_to_column("taves_subj") %>%
        full_join(d0 %>% distinct(taves_subj, taves_ctry)) %>%
        gather(question, appraisal, -c(taves_subj, taves_ctry)) %>%
        mutate(question = gsub("b_num", "", question)) %>%
        full_join(hclust_df_grouped %>% 
                    select(question, cluster, starts_with("level")) %>%
                    distinct()) %>%
  filter(!is.na(appraisal)) %>%
  distinct()
```

```{r}
lmer(appraisal ~ level2common + (1|question) + (1|taves_ctry/taves_subj),
     d_appraisal_num2 %>% filter(!is.na(appraisal))) %>% 
  summary()
```

```{r}
d_appraisal_num2 %>%
  filter(!is.na(level2common)) %>%
  group_by(level2common, taves_ctry) %>%
  multi_boot_standard("appraisal") %>%
  ungroup() %>%
  mutate(taves_ctry = factor(taves_ctry, 
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu"))) %>%
  ggplot(aes(x = level2common, y = mean, color = taves_ctry)) +
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = 0.5)) +
  ylim(-1, 1) +
  scale_color_brewer(palette = "Dark2") +
  theme_bw() +
  labs(title = "Prediction: Clusters 679 appraised as religious more frequently\nthan cluster 4",
       x = "Cluster group (level 2 divide, among 'common' clusters only)", 
       y = "Appraisal (-1 to +1)",
       color = "Site")
```

```{r}
d_appraisal_num2 %>%
  filter(!is.na(level2common)) %>%
  group_by(level2common, taves_ctry) %>%
  multi_boot_standard("appraisal") %>%
  ungroup() %>%
  mutate(taves_ctry = factor(taves_ctry, 
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu"))) %>%
  ggplot(aes(x = level2common, 
             y = mean, color = taves_ctry)) + 
  facet_grid(~ taves_ctry) +
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = 0.5)) +
  ylim(-1, 1) +
  scale_color_brewer(palette = "Dark2") +
  theme_bw() +
  labs(title = "Prediction: Clusters 679 appraised as religious more frequently\nthan cluster 4",
       x = "Cluster group (level 2 divide, among 'common' clusters only)", 
       y = "Appraisal (-1 to +1)",
       color = "Site")
```

The results for this prediction are mixed. It is *not* upheld in the aggregate, but seems to be true in the US, and possibly in Ghana and Vanuatu (i.e., Christian countries?).


<P style="page-break-before: always">
## Clusters 679 will be marked as positive/negative rather than cluster 4 (?)

```{r}
d_valence_num2 <- d_valence_num %>%
  rownames_to_column("taves_subj") %>%
  full_join(d0 %>% distinct(taves_subj, taves_ctry)) %>%
  gather(question, valence, -c(taves_subj, taves_ctry)) %>%
  mutate(question = gsub("d_num", "", question)) %>%
  full_join(hclust_df_grouped %>%
              select(question, cluster, starts_with("level")) %>%
              distinct()) %>%
  filter(!is.na(valence)) %>%
  distinct()
```

```{r}
lmer(abs(valence) ~ level2common + (1|question) + (1|taves_ctry/taves_subj),
     d_valence_num2 %>% filter(!is.na(valence))) %>% 
  summary()
```

```{r}
d_valence_num2 %>%
  filter(!is.na(level2common)) %>%
  group_by(level2common, taves_ctry) %>%
  mutate(valence = abs(valence)) %>%
  multi_boot_standard("valence") %>%
  ungroup() %>%
  mutate(taves_ctry = factor(taves_ctry, 
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu"))) %>%
  ggplot(aes(x = level2common, y = mean, color = taves_ctry)) +
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = 0.5)) +
  ylim(0, 1) +
  scale_color_brewer(palette = "Dark2") +
  theme_bw() +
  labs(title = "Prediction: Clusters 679 will be marked as\npositive/negative rather than cluster 4",
       x = "Cluster group (level 2 divide, among 'common' clusters only)", 
       y = "Absolute valence (0 to 1)",
       color = "Site")
```

Yes, this prediction is upheld (both in the aggregate and looking at each site individually, with the possible exception of Vanuatu).


<P style="page-break-before: always">
## Cluster 69 will be more positive than cluster 7

```{r}
lmer(valence ~ level3 + (1|question) + (1|taves_ctry/taves_subj),
     d_valence_num2 %>% filter(!is.na(valence))) %>% 
  summary()
```

```{r}
d_valence_num2 %>%
  filter(!is.na(level3)) %>%
  group_by(level3, taves_ctry) %>%
  multi_boot_standard("valence") %>%
  ungroup() %>%
  mutate(taves_ctry = factor(taves_ctry, 
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu"))) %>%
  ggplot(aes(x = level3, y = mean, color = taves_ctry)) +
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = 0.5)) +
  ylim(-1, 1) +
  scale_color_brewer(palette = "Dark2") +
  theme_bw() +
  labs(title = "Prediction: Clusters 69 appraised as more\npositive than 7",
       x = "Cluster group (level 3 divide, ...)", 
       y = "Valence (-1 to 1)",
       color = "Site")
```

Yes, this prediction is clearly upheld (both in the aggregate and looking at each site individually).


<P style="page-break-before: always">
## Clusters 679 appraised as more significant than 4

```{r}
d_significance_num2 <- d_significance_num %>%
        rownames_to_column("taves_subj") %>%
        full_join(d0 %>% distinct(taves_subj, taves_ctry)) %>%
        gather(question, significance, -c(taves_subj, taves_ctry)) %>%
        mutate(question = gsub("e_num", "", question)) %>%
        full_join(hclust_df_grouped %>% 
                    select(question, cluster, starts_with("level")) %>%
                    distinct()) %>%
  filter(!is.na(significance)) %>%
  distinct()
```

```{r}
lmer(significance ~ level2common + (1|question) + (1|taves_ctry/taves_subj),
     d_significance_num2 %>% filter(!is.na(significance))) %>% 
  summary()
```

```{r}
d_significance_num2 %>%
  filter(!is.na(level2common)) %>%
  group_by(level2common, taves_ctry) %>%
  multi_boot_standard("significance") %>%
  ungroup() %>%
  mutate(taves_ctry = factor(taves_ctry, 
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu"))) %>%
  ggplot(aes(x = level2common, y = mean, color = taves_ctry)) +
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = 0.5)) +
  ylim(0, 1) +
  scale_color_brewer(palette = "Dark2") +
  theme_bw() +
  labs(title = "Prediction: Clusters 679 appraised as more\nsignificant than 4",
       x = "Cluster group (level 2 divide, among 'common' clusters only)", 
       y = "Significance (0 to 1)",
       color = "Site")
```

Yes, this prediction is upheld (both in the aggregate and looking at each site individually).


<P style="page-break-before: always">
## Clusters 851 more common than clusters 32

```{r}
glmer(prev ~ level2uncommon + (1|question) + (1|taves_ctry),
      hclust_df_grouped,
      family = "binomial", weights = n) %>% 
  summary()
```

```{r}
hclust_df_grouped %>%
  filter(!is.na(level2uncommon)) %>%
  group_by(level2uncommon, taves_ctry) %>%
  multi_boot_standard("prev") %>%
  ungroup() %>%
  mutate(taves_ctry = factor(taves_ctry, 
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu"))) %>%
  ggplot(aes(x = level2uncommon, y = mean, color = taves_ctry)) + 
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = 0.5)) +
  ylim(0, 1) +
  scale_color_brewer(palette = "Dark2") +
  theme_bw() +
  labs(title = "Prediction: Clusters 851 more common than clusters 32",
       x = "Cluster group (level 2 divide, among 'uncommon' clusters only)", 
       y = "Prevalence (0-1)", color = "Site")
```

Yes, this prediction is upheld (both in the aggregate and looking at each site individually, with the probable exception of Vanuatu).


<P style="page-break-before: always">
## Clusters 8 to 5 to 1: increases in divergence between countries

```{r}
hclust_df_grouped %>%
  filter(cluster %in% c("1", "5", "8")) %>% 
  group_by(cluster, taves_ctry) %>%
  summarise(mean_prev = mean(prev, na.rm = T)) %>%
  group_by(cluster) %>%
  summarise(sd = sd(mean_prev))

bartlett.test(mean_prev ~ cluster, 
              hclust_df_grouped %>%
                filter(cluster %in% c("1", "5", "8")) %>%
                group_by(cluster, taves_ctry) %>%
                summarise(mean_prev = mean(prev, na.rm = T)))

bartlett.test(mean_prev ~ cluster, 
              hclust_df_grouped %>%
                filter(cluster %in% c("1", "5")) %>%
                group_by(cluster, taves_ctry) %>%
                summarise(mean_prev = mean(prev, na.rm = T)))

bartlett.test(mean_prev ~ cluster, 
              hclust_df_grouped %>%
                filter(cluster %in% c("5", "8")) %>%
                group_by(cluster, taves_ctry) %>%
                summarise(mean_prev = mean(prev, na.rm = T)))
```

```{r}
hclust_df_grouped %>%
  filter(cluster %in% c("1", "5", "8")) %>%
  group_by(cluster, taves_ctry) %>%
  multi_boot_standard("prev") %>%
  ungroup() %>%
  mutate(taves_ctry = factor(taves_ctry, 
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu"))) %>%
  ggplot(aes(x = factor(cluster, levels = c("8", "5", "1")), 
             y = mean, color = taves_ctry)) + 
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = 0.5)) +
  ylim(0, 1) +
  scale_color_brewer(palette = "Dark2") +
  theme_bw() +
  labs(title = "Prediction: Clusters 8 to 5 to 1: increases in divergence\nbetween countries",
       x = "Cluster group (level 2 divide, among 'uncommon' clusters only)", 
       y = "Prevalence (0-1)", color = "Site")
```

This prediction looks like it's upheld according to the plot (dots for each site are closest together for cluster 8 and most scattered for cluster 1). It doesn't come out as significant in a formal test of the variability across the five sites for each cluster, but I'm not sure whether I'd expect that to come out or not, given the sample size (n = 5 sites). (Genuinely not sure!)


<P style="page-break-before: always">
## Clusters 8 to 5 to 1: decreases in frequency

```{r}
glmer(prev ~ cluster + (1|question) + (1|taves_ctry),
      hclust_df_grouped %>%
        filter(cluster %in% c("1", "5", "8")) %>%
        mutate(cluster = factor(cluster, levels = c("8", "5", "1"))),
      contrasts = list(cluster = contr.poly(3)),
      family = "binomial", weights = n) %>% 
  summary()
```

```{r}
hclust_df_grouped %>%
  filter(cluster %in% c("1", "5", "8")) %>%
  group_by(cluster, taves_ctry) %>%
  multi_boot_standard("prev") %>%
  ungroup() %>%
  mutate(taves_ctry = factor(taves_ctry, 
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu"))) %>%
  ggplot(aes(x = factor(cluster, levels = c("8", "5", "1")), 
             y = mean, color = taves_ctry)) + 
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = 0.5)) +
  ylim(0, 1) +
  scale_color_brewer(palette = "Dark2") +
  theme_bw() +
  labs(title = "Prediction: Clusters 8 to 5 to 1: decreases in frequency",
       x = "Cluster group (level 2 divide, among 'uncommon' clusters only)", 
       y = "Prevalence (0-1)", color = "Site")
```

No, this prediction is *not* upheld (either in the aggregate or looking at each site individually, with the possible exception of Thailand). If anything, cluster 1 generally seems to be the most common of these three clusters.


<P style="page-break-before: always">
## Clusters 32: GH and VT stand out for their endorsement

```{r}
glmer(prev ~ level2uncommon * taves_ctry + (1|question),
      hclust_df_grouped %>%
        filter(!is.na(level2uncommon)) %>%
        mutate(taves_ctry = factor(taves_ctry,
                                   levels = c("US", "Ghana", "Thailand",
                                              "China", "Vanuatu"))),
      contrasts = list(taves_ctry = cbind("US_other" = c(4, -1, -1, -1, -1),
                                          "GHVT_THCH" = c(0, 1, -1, -1, 1),
                                          "GH_VT" = c(0, 1, 0, 0, -1),
                                          "TH_CH" = c(0, 0, -1, 1, 0))),
      family = "binomial", weights = n) %>% 
  summary()
```

```{r}
hclust_df_grouped %>%
  filter(!is.na(level2uncommon)) %>%
  group_by(level2uncommon, taves_ctry) %>%
  multi_boot_standard("prev") %>%
  ungroup() %>%
  mutate(taves_ctry = factor(taves_ctry, 
                             levels = c("US", "Ghana", "Thailand", 
                                        "China", "Vanuatu")),
         GHVT_oth = case_when(taves_ctry %in% c("Ghana", "Vanuatu") ~ "GH & VT",
                              TRUE ~ "other")) %>%
  ggplot(aes(x = level2uncommon, y = mean, color = taves_ctry, shape = GHVT_oth)) + 
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = 0.5)) +
  ylim(0, 1) +
  scale_color_brewer(palette = "Dark2") +
  theme_bw() +
  labs(title = "Prediction: Clusters 32: GH and VT stand out for\ntheir endorsement",
       x = "Cluster group (level 2 divide, among 'uncommon' clusters only)", 
       y = "Prevalence (0-1)", color = "Site", shape = "Ghana/Vanuatu?")
```

Yes, this prediction is upheld, but rather weakly: Ghana and Vanuatu stand out in both clusters 32 and clusters 851, but they stand out a little more (gap between them and other countries is a little bigger) in clusters 32.


<P style="page-break-before: always">
# Overall summaries of prevalence, appraisal, valence, & significance

```{r, include = F}
df_prevalence <- d_base_num %>%
  rownames_to_column("taves_subj") %>%
  full_join(d0 %>% select(taves_subj, taves_ctry)) %>%
  gather(question, endorsement, -c(taves_subj, taves_ctry)) %>%
  mutate(question = gsub("_num", "", question)) %>%
  filter(!is.na(endorsement)) %>%
  group_by(taves_ctry, question) %>%
  mutate(prev_n = n()) %>%
  group_by(prev_n, add = T) %>%
  multi_boot_standard("endorsement") %>%
  ungroup() %>%
  rename(prev_ci_lower = ci_lower,
         prev_ci_upper = ci_upper,
         prev_mean = mean)

df_appraisal <- d_appraisal_num2 %>%
  mutate(question = gsub("._num", "", question)) %>%
  filter(!is.na(appraisal)) %>%
  group_by(taves_ctry, level1, level2uncommon, level2common, level3, 
           cluster, question) %>%
  mutate(app_n = n()) %>%
  group_by(app_n, add = T) %>%
  multi_boot_standard("appraisal") %>%
  ungroup() %>%
  rename(app_ci_lower = ci_lower,
         app_ci_upper = ci_upper,
         app_mean = mean)

df_valence <- d_valence_num2 %>%
  mutate(question = gsub("._num", "", question)) %>%
  filter(!is.na(valence)) %>%
  group_by(taves_ctry, level1, level2uncommon, level2common, level3, 
           cluster, question) %>%
  mutate(val_n = n()) %>%
  group_by(val_n, add = T) %>%
  multi_boot_standard("valence") %>%
  ungroup() %>%
  rename(val_ci_lower = ci_lower,
         val_ci_upper = ci_upper,
         val_mean = mean)

df_significance <- d_significance_num2 %>%
  mutate(question = gsub("._num", "", question)) %>%
  filter(!is.na(significance)) %>%
  group_by(taves_ctry, level1, level2uncommon, level2common, level3, 
           cluster, question) %>%
  mutate(sig_n = n()) %>%
  group_by(sig_n, add = T) %>%
  multi_boot_standard("significance") %>%
  ungroup() %>%
  rename(sig_ci_lower = ci_lower,
         sig_ci_upper = ci_upper,
         sig_mean = mean)

df_all_resp <- hclust_df_grouped %>% 
  distinct(question, question_text, US_prev,  
           level1, level2uncommon, level2common, level3, cluster) %>%
  rename(prev_US = US_prev) %>%
  full_join(df_prevalence) %>%
  full_join(df_appraisal) %>%
  full_join(df_valence) %>%
  full_join(df_significance) %>%
  mutate(level2 = ifelse(is.na(level2uncommon), 
                         as.character(level2common), 
                         as.character(level2uncommon)),
         cluster = factor(cluster,
                          levels = c("8", "5", "1", "3", "2",
                                     "6", "9", "7", "4"),
                          labels = c("cluster 8", "cluster 5", "cluster 1",
                                     "cluster 3", "cluster 2", "cluster 6",
                                     "cluster 9", "cluster 7 ", "cluster 4")),
         level1 = factor(level1,
                         levels = c("85132", "6794"),
                         labels = c("L1: 85132", "L1: 6794")),
         level2 = factor(level2,
                         levels = c("851", "32", "679", "4"),
                         labels = c("L2: 851", "L2: 32", "L2: 679", "L2: 4")),
         taves_ctry = factor(taves_ctry,
                             levels = c("US", "Ghana", "Thailand",
                                        "China", "Vanuatu"))) %>%
  select(taves_ctry, level1, level2, level2uncommon, level2common, level3, 
         cluster, question, question_text, prev_US, 
         starts_with("prev"), starts_with("app"), starts_with("sig"),
         starts_with("val"))
```

```{r, fig.width = 6, fig.asp = 0.8}
df_all_resp %>%
  ggplot(aes(x = question, y = prev_mean, color = taves_ctry, label = prev_n)) +
  facet_grid(cols = vars(level1, level2, cluster), 
             rows = vars(taves_ctry), 
             scales = "free_x", space = "free_x") +
  geom_hline(yintercept = 0.5, lty = 2) +
  geom_pointrange(aes(ymin = prev_ci_lower, ymax = prev_ci_upper)) +
  geom_text(size = 2, y = 0.05, color = "black") +
  scale_color_brewer(palette = "Set2") +
  ylim(0, 1) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        legend.position = "none") +
  labs(title = "PREVALENCE: Proportion of participants who endorsed experience",
       subtitle = "By site (rows), question (x-axis), and levels 1-2 grouping of clusters (columns)\nNumbers for each question indicate the total number of participants from that site who responded to that question; error bars are 95% bootstrapped CIs",
       x = "Question",
       y = "Prevalence (0-1)", color = "Site")
```

<P style="page-break-before: always">
```{r, fig.width = 6, fig.asp = 0.8}
df_all_resp %>%
  ggplot(aes(x = question, y = app_mean, color = taves_ctry, label = app_n)) +
  facet_grid(cols = vars(level1, level2, cluster), 
             rows = vars(taves_ctry), 
             scales = "free_x", space = "free_x") +
  geom_hline(yintercept = 0, lty = 2) +
  geom_pointrange(aes(ymin = app_ci_lower, ymax = app_ci_upper)) +
  geom_text(size = 2, y = -0.95, color = "black") +
  scale_color_brewer(palette = "Set2") +
  ylim(-1, 1) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        legend.position = "none") +
  labs(title = "APPRAISAL: Mean response to whether the participant appraised experience as religious/spiritual",
       subtitle = "By site (rows), question (x-axis), and levels 1-2 grouping of clusters (columns)\nNumbers for each question indicate the total number of participants from that site who responded to that question; error bars are 95% bootstrapped CIs",
       x = "Question",
       y = "Appraisal (-1: no, 0: unsure, +1: yes)", color = "Site")
```

<P style="page-break-before: always">
```{r, fig.width = 6, fig.asp = 0.8}
df_all_resp %>%
  ggplot(aes(x = question, y = val_mean, color = taves_ctry, label = val_n)) +
  facet_grid(cols = vars(level1, level2, cluster), 
             rows = vars(taves_ctry), 
             scales = "free_x", space = "free_x") +
  geom_hline(yintercept = 0, lty = 2) +
  geom_pointrange(aes(ymin = val_ci_lower, ymax = val_ci_upper)) +
  geom_text(size = 2, y = -0.95, color = "black") +
  scale_color_brewer(palette = "Set2") +
  ylim(-1, 1) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        legend.position = "none") +
  labs(title = "VALENCE: Mean response to whether experience was considered positive/negative",
       subtitle = "By site (rows), question (x-axis), and levels 1-2 grouping of clusters (columns)\nNumbers for each question indicate the total number of participants from that site who responded to that question; error bars are 95% bootstrapped CIs",
       x = "Question",
       y = "Valence (-1: negative, 0: neutral, +1: positive)", color = "Site")
```

<P style="page-break-before: always">
```{r, fig.width = 6, fig.asp = 0.8}
df_all_resp %>%
  ggplot(aes(x = question, y = sig_mean, color = taves_ctry, label = sig_n)) +
  facet_grid(cols = vars(level1, level2, cluster), 
             rows = vars(taves_ctry), 
             scales = "free_x", space = "free_x") +
  geom_hline(yintercept = 0.5, lty = 2) +
  geom_pointrange(aes(ymin = sig_ci_lower, ymax = sig_ci_upper)) +
  geom_text(size = 2, y = 0.05, color = "black") +
  scale_color_brewer(palette = "Set2") +
  ylim(0, 1) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
        legend.position = "none") +
  labs(title = "SIGNIFICANCE: Mean response to how significant experience was considered",
       subtitle = "By site (rows), question (x-axis), and levels 1-2 grouping of clusters (columns)\nNumbers for each question indicate the total number of participants from that site who responded to that question; error bars are 95% bootstrapped CIs",
       x = "Question",
       y = "Significance (0: not, 0.5: somewhat, 1: very)", color = "Site")
```


